Show the code
import pandas as pd
import numpy as np
import sqlite3
from lets_plot import *
LetsPlot.setup_html(isolated_frame=True)Course DS 250
Aidan Pfunder
Have you ever wondered what secrets lie hidden in the data behind baseball? I’ve written custom SQL queries in Python to show the standout players from BYU-Idaho, calculate batting averages, and compare average salaries between top teams like the Yankees and Red Sox.
Write an SQL query to create a new dataframe about baseball players who attended BYU-Idaho. The new table should contain five columns: playerID, schoolID, salary, and the yearID/teamID associated with each salary. Order the table by salary (highest to lowest) and print out the table in your report.
type your results and analysis here
query_byu_idaho_salaries = """
SELECT
collegeplaying.playerID,
collegeplaying.schoolID,
salaries.salary,
salaries.yearID,
salaries.teamID
FROM collegeplaying
JOIN salaries
ON collegeplaying.playerID = salaries.playerID
WHERE collegeplaying.schoolID = 'idbyuid'
ORDER BY salaries.salary DESC;
"""
df_byu_idaho_salaries = pd.read_sql_query(query_byu_idaho_salaries, con)
print(df_byu_idaho_salaries) playerID schoolID salary yearID teamID
0 lindsma01 idbyuid 4000000.0 2014 CHA
1 lindsma01 idbyuid 4000000.0 2014 CHA
2 lindsma01 idbyuid 3600000.0 2012 BAL
3 lindsma01 idbyuid 3600000.0 2012 BAL
4 lindsma01 idbyuid 2800000.0 2011 COL
5 lindsma01 idbyuid 2800000.0 2011 COL
6 lindsma01 idbyuid 2300000.0 2013 CHA
7 lindsma01 idbyuid 2300000.0 2013 CHA
8 lindsma01 idbyuid 1625000.0 2010 HOU
9 lindsma01 idbyuid 1625000.0 2010 HOU
10 stephga01 idbyuid 1025000.0 2001 SLN
11 stephga01 idbyuid 1025000.0 2001 SLN
12 stephga01 idbyuid 900000.0 2002 SLN
13 stephga01 idbyuid 900000.0 2002 SLN
14 stephga01 idbyuid 800000.0 2003 SLN
15 stephga01 idbyuid 800000.0 2003 SLN
16 stephga01 idbyuid 550000.0 2000 SLN
17 stephga01 idbyuid 550000.0 2000 SLN
18 lindsma01 idbyuid 410000.0 2009 FLO
19 lindsma01 idbyuid 410000.0 2009 FLO
20 lindsma01 idbyuid 395000.0 2008 FLO
21 lindsma01 idbyuid 395000.0 2008 FLO
22 lindsma01 idbyuid 380000.0 2007 FLO
23 lindsma01 idbyuid 380000.0 2007 FLO
24 stephga01 idbyuid 215000.0 1999 SLN
25 stephga01 idbyuid 215000.0 1999 SLN
26 stephga01 idbyuid 185000.0 1998 PHI
27 stephga01 idbyuid 185000.0 1998 PHI
28 stephga01 idbyuid 150000.0 1997 PHI
29 stephga01 idbyuid 150000.0 1997 PHI
This three-part question requires you to calculate batting average (number of hits divided by the number of at-bats)
a. Write an SQL query that provides playerID, yearID, and batting average for players with at least 1 at bat that year. Sort the table from highest batting average to lowest, and then by playerid alphabetically. Show the top 5 results in your report.
a. Use the same query as above, but only include players with at least 10 at bats that year. Print the top 5 results.
a. Now calculate the batting average for players over their entire careers (all years combined). Only include players with at least 100 at bats, and print the top 5 results.
type your results and analysis here
playerID yearID batting_avg
0 aberal01 1957 1.0
1 abernte02 1960 1.0
2 abramge01 1923 1.0
3 acklefr01 1964 1.0
4 alanirj01 2019 1.0
playerID yearID batting_avg
0 nymanny01 1974 0.642857
1 carsoma01 2013 0.636364
2 altizda01 1910 0.600000
3 johnsde01 1975 0.600000
4 silvech01 1948 0.571429
query_career_batting_avg = """
SELECT
playerID,
SUM(H) AS total_hits,
SUM(AB) AS total_at_bats,
SUM(H)*1.0/SUM(AB) AS batting_avg
FROM batting
GROUP BY playerID
HAVING SUM(AB) >= 100
ORDER BY batting_avg DESC
LIMIT 5
"""
df_career_batting_avg = pd.read_sql_query(query_career_batting_avg, con)
print(df_career_batting_avg) playerID total_hits total_at_bats batting_avg
0 cobbty01 4189 11436 0.366299
1 barnero01 860 2391 0.359682
2 hornsro01 2930 8173 0.358497
3 jacksjo01 1772 4981 0.355752
4 meyerle01 513 1443 0.355509
Pick any two baseball teams and compare them using a metric of your choice (average salary, home runs, number of wins, etc). Write an SQL query to get the data you need, then make a graph using Lets-Plot to visualize the comparison. What do you learn?
I compared the average salaries between the Boston Red Sox and the New York Yankees. I learned that the New York Yankees have started paying their players significantly mkore on average since 2002. Both teams have increased their average salaries over time.
from lets_plot import *
LetsPlot.setup_html(isolated_frame=True)
query_team_salary = """
SELECT
yearID,
teamID,
AVG(salary) AS avg_salary
FROM salaries
WHERE teamID IN ('NYA', 'BOS')
GROUP BY yearID, teamID
ORDER BY yearID, teamID;
"""
df_team_salary = pd.read_sql_query(query_team_salary, con)
print(df_team_salary)
plot_salary_compare = (
ggplot(df_team_salary, aes(x='yearID', y='avg_salary', fill='teamID')) +
geom_bar(stat='identity', position='dodge') +
labs(
x = "Season Year",
y = "Average Salary",
title = "Average Player Salary: Yankees vs Red Sox"
)
)
plot_salary_compare.show() yearID teamID avg_salary
0 1985 BOS 4.359024e+05
1 1985 NYA 7.119102e+05
2 1986 BOS 4.966289e+05
3 1986 NYA 6.605090e+05
4 1987 BOS 6.762778e+05
.. ... ... ...
59 2014 NYA 8.230996e+06
60 2015 BOS 5.659481e+06
61 2015 NYA 7.336274e+06
62 2016 BOS 6.501578e+06
63 2016 NYA 7.689579e+06
[64 rows x 3 columns]
Advanced Salary Distribution by Position (with Case Statement):
* Write an SQL query that provides a summary table showing the average salary for each position (e.g., pitcher, catcher, outfielder). Position information can be found in the fielding table in the POS column.
Include the following columns:
* position
* average_salary
* total_players
* highest_salary
* The highest_salary column should display the highest salary ever earned by a player in that position.
* Additionally, create a new column called salary_category using a case statement:
* If the average salary is above $3 million, categorize it as “High Salary.”
* If the average salary is between $2 million and $3 million, categorize it as “Medium Salary.”
* Otherwise, categorize it as “Low Salary.”
* Order the table by average salary in descending order.
**Hint:** Beware, it is common for a player to play multiple positions in a single year. For this analysis, each player’s salary should only be counted toward one position in a given year: the position at which they played the most games that year. This will likely require a (sub-query)[https://docs.data.world/documentation/sql/concepts/advanced/WITH.html].
This shows that first base players have the highest average salary for baseball players. It is interested that although the highest paid player is a pitcher, pitchers rank 4th according to average salaries. This may be because pitcher has the highest number of players, meaning that they cannot all be top earners.
query_salary_by_position_named = """
WITH primary_pos AS (
SELECT
playerID,
yearID,
POS AS pos,
SUM(G) AS games_played,
ROW_NUMBER() OVER (
PARTITION BY playerID, yearID
ORDER BY SUM(G) DESC
) AS rn
FROM fielding
GROUP BY playerID, yearID, POS
)
SELECT
CASE primary_pos.pos
WHEN 'P' THEN 'Pitcher'
WHEN 'C' THEN 'Catcher'
WHEN '1B' THEN 'First Base'
WHEN '2B' THEN 'Second Base'
WHEN 'SS' THEN 'Shortstop'
WHEN '3B' THEN 'Third Base'
WHEN 'LF' THEN 'Left Field'
WHEN 'CF' THEN 'Center Field'
WHEN 'RF' THEN 'Right Field'
WHEN 'DH' THEN 'Designated Hitter'
ELSE primary_pos.pos
END AS position,
AVG(salaries.salary) AS average_salary,
COUNT(DISTINCT salaries.playerID) AS total_players,
MAX(salaries.salary) AS highest_salary,
CASE
WHEN AVG(salaries.salary) > 3000000 THEN 'High Salary'
WHEN AVG(salaries.salary) >= 2000000 THEN 'Medium Salary'
ELSE 'Low Salary'
END AS salary_category
FROM salaries
JOIN primary_pos
ON salaries.playerID = primary_pos.playerID
AND salaries.yearID = primary_pos.yearID
WHERE primary_pos.rn = 1
GROUP BY position
ORDER BY average_salary DESC;
"""
df_salary_by_position_named = pd.read_sql_query(query_salary_by_position_named, con)
print(df_salary_by_position_named) position average_salary total_players highest_salary salary_category
0 First Base 3.380780e+06 454 28000000.0 High Salary
1 OF 2.396919e+06 1128 27328046.0 Medium Salary
2 Third Base 2.323326e+06 490 33000000.0 Medium Salary
3 Shortstop 1.973604e+06 377 22600000.0 Low Salary
4 Pitcher 1.938131e+06 2556 33000000.0 Low Salary
5 Second Base 1.821074e+06 478 24000000.0 Low Salary
6 Catcher 1.430338e+06 403 23000000.0 Low Salary
Advanced Career Longevity and Performance (with Subqueries):
* Calculate the average career length (in years) for players who have played at least **10 games**. Then, identify the top 10 players with the longest careers (based on the number of years they played). Include their:
* playerID
* first_name
* last_name
* career_length
* The career_length should be calculated as the difference between the maximum and minimum yearID for each player.
The top ten longest carreers for baseball players is dominated by older players. Of the ten, six of them began their careers in the 1800s with two more beginning before 1904. The player with the logest career is Jim O’Rourke with 32 years, who began his career in 1872. This could be a sign that modern players do not last as long in their career due to more intense play or stricter work out regiments.
query_career_longevity_with_start = """
WITH career_stats AS (
SELECT
playerID,
MIN(yearID) AS first_year,
MAX(yearID) AS last_year,
(MAX(yearID) - MIN(yearID)) AS career_length,
SUM(G) AS total_games
FROM fielding
GROUP BY playerID
HAVING SUM(G) >= 10
)
SELECT
cs.playerID,
p.nameFirst AS first_name,
p.nameLast AS last_name,
cs.first_year AS start_year,
cs.career_length
FROM career_stats AS cs
JOIN people AS p
ON cs.playerID = p.playerID
ORDER BY cs.career_length DESC
LIMIT 10;
"""
df_career_longevity_with_start = pd.read_sql_query(query_career_longevity_with_start, con)
print(df_career_longevity_with_start) playerID first_name last_name start_year career_length
0 orourji01 Jim O'Rourke 1872 32
1 altroni01 Nick Altrock 1898 31
2 lathaar01 Arlie Latham 1880 29
3 mcguide01 Deacon McGuire 1884 28
4 eversjo01 Johnny Evers 1902 27
5 jennihu01 Hughie Jennings 1891 27
6 ryanno01 Nolan Ryan 1966 27
7 streega01 Gabby Street 1904 27
8 ansonca01 Cap Anson 1871 26
9 johnto01 Tommy John 1963 26